Received 22 February 2019; revised 14 April 2019 and 11 May 2019; accepted 29 May 2019.

Date of publication 19 June 2019; date of current version 3 February 2020.

Digital Object Identifier 10.1109/JXCDC.2019.2923745

# A Ferroelectric FET-Based Processing-in-Memory Architecture for DNN Acceleration

YUN LONG<sup>®</sup> (Student Member, IEEE), DAEHYUN KIM, EDWARD LEE (Graduate Student Member, IEEE), PRIYABRATA SAHA (Graduate Student Member, IEEE), BURHAN AHMAD MUDASSAR (Student Member, IEEE), XUEYUAN SHE (Graduate Student Member, IEEE), ASIF ISLAM KHAN<sup>®</sup> (Member, IEEE), AND SAIBAL MUKHOPADHYAY (Fellow, IEEE)

School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA CORRESPONDING AUTHOR: Y. LONG (yunlong@gatech.edu)

This work was supported in part by the National Science Foundation (NSF) under Grant #1810005.

**ABSTRACT** This paper presents a ferroelectric FET (FeFET)-based processing-in-memory (PIM) architecture to accelerate the inference of deep neural networks (DNNs). We propose a digital in-memory vector-matrix multiplication (VMM) engine design utilizing the FeFET crossbar to enable bit-parallel computation and eliminate analog-to-digital conversion in prior mixed-signal PIM designs. A dedicated hierarchical network-on-chip (H-NoC) is developed for input broadcasting and on-the-fly partial results processing, reducing the data transmission volume and latency. Simulations in 28-nm CMOS technology show 115× and 6.3× higher computing efficiency (GOPs/W) over desktop GPU (Nvidia GTX 1080Ti) and resistive random access memory (ReRAM)-based design, respectively.

**INDEX TERMS** Deep neural network (DNN), ferroelectric FET (FeFET), processing-in-memory (PIM).

#### I. INTRODUCTION

THE WIDESPREAD adoption of deep neural networks (DNNs) in solving complex problems in various domains has inspired the design of many dedicated hardware accelerators [1]–[3]. However, as DNN models become deeper and require more parameters, the time/energy cost of moving data between memory and logic starts limiting the efficiency of the computing. Memory rich architecture integrates a large amount of on-chip memory to reduce the DRAM access [2], [3]. Near-memory-processing (NMP) architecture embeds logic engines within off-chip memory to reduce the cost of data-movement [4], [5]. A more aggressive approach is to directly perform computation inside memory, often referred to as the processing-in-memory (PIM) architectures [6]-[11]. The PIM architectures are designed to perform vector-matrix multiplication (VMM) within the memory (i.e., VMM-in-memory). The current summation at the bitline is used to perform multiplication-accumulation (MAC) operation, resulting in very high throughput. The resistive random access memory (ReRAM) crossbar based on memory VMM engine design has shown promise of high energy-efficiency, thanks to the zero-leakage storage, high-density (1 T cells), and nonvolatility of ReRAM devices [6]–[8]. However, recent studies have noted challenges associated with ReRAM's high write power, long read latency, and challenges in driving large crossbar arrays with many parallel ReRAM devices [11]. Moreover, due to the nature of analog computation for current summation, analog/digital conversion (ADC and DAC) is necessary for data conversion, leading to high power/area overhead. As an alternative, some prior PIM-based designs utilize DRAM [9] or SRAM [10] to performs the basic logic operation (such as NOR and AND) inside the memory rather than current summation, thereby, avoiding ADC/DAC. However, such approaches suffer from large leakage current and reduced computing density due to their low-level logic abstraction.

This paper presents a ferroelectric FET (FeFET)-based PIM architecture for *in-memory* vector–matrix computation to accelerate DNN inference. Our design is built on three core concepts.

1) We employ FeFET as the basic memory cell. Compared with ReRAM, FeFET presents much less read latency (less *RC* delay) and ultralow programming energy while it keeps similar density and nonvolatility.



FIGURE 1. FeFET structure and its ON/OFF state.

- 2) We exploit the gate-driven operation of FeFET to design the all-digital VMM engine, eliminating ADC/DAC while ensuring high throughput.
- 3) We present a scalable microarchitecture by connecting multiple *VMM* engines using a hierarchical network-on-chip (H-NoC) with in-router accumulator, reducing the data transmission volume and latency.

A chip-scale architecture is developed using our VMM-inmemory fabric coupled with specialized functional blocks, on-chip storage. A dedicated execution flow and software-hardware interface is developed to improve the system flexibility. The proposed system is implemented in 28-nm CMOS technology. The area, power, and frequency of operation are simulated using postlayout analyses of blocks and network. The cycle-level simulation is used for application-level performance analyses for the different convolutional neural network (CNN) and recurrent neural network (RNN) models. Our design demonstrates 115× and 6.3× higher computing efficiency (GOPs/W) over GPU (Nvidia GTX 1080Ti) and ReRAM-based PIM design, respectively.

## II. FeFET BACKGROUND

Various memory techniques have been explored for PIM-based designs including DRAM [9], eDRAM [2], [3], SRAM [10], and ReRAM [6], [7]. Among these memory techniques, ReRAM attracted lots of research attention recently and showed promise for very high energy-efficiency. However, recent studies also note challenges associated with ReRAM's high write power, long read latency, and power/area overhead for driving large arrays [11].

To address the challenges presented in ReRAM, we propose to use FeFET as an alternative memory solution for PIM architecture. FeFET is a transistor in which the ferroelectric oxide layer is included in the gate dielectric stack, as shown in Fig. 1. A ferroelectric oxide is an insulator, which exhibits a spontaneous electric polarization in the absence of electric field. The direction of the polarization can be switched by applying a voltage larger than the coercive voltage on the gate terminal of FeFET [12]. When the polarization is pointing downward, the channel is in inversion, bringing the transistor into the "ON" state (i.e., low fifth state). Similarly, if the polarization is pointing upward, the channel is in accumulation, which gives the transistor "OFF" state (i.e., high fifth state).



FIGURE 2. (a) Configuration of FeFET crossbar. (b) Layout view of a 256  $\times$  256 crossbar under 28-nm technology. (c) Measured  $I_{\rm ds}$ – $V_{\rm gs}$  characteristics from a FeFET device. Measurement data are extracted from [13] with the transistor size 30 nm  $\times$  80 nm. FeFET based 1-bit multiplication (i.e., AND logic).

It has already been demonstrated that Hafnium oxide FeFET has good temperature stability, writing endurance, data retention, and switching speed/energy [12]–[14]. The ultralow writing energy due to the unique electrical field effect switching mechanism is the most prominent feature, which distinguishes FeFET from other emerging technologies. Further, unlike ReRAM, which presents as resistive loads for reading, FeFET is gate-driven (i.e., capacitive load), eliminating the large *RC* delay in ReRAM case and reducing the read latency.

Besides utilizing FeFET as nonvolatile memory [12]–[14], there have been a few recent works exploring FeFET based logic (AND, OR, etc.) design [15], oscillator design [16], spiking neural network [17], and binary neural network acceleration (using four FeFET cells for XNOR logic) [18]. These works focus on device/crossbar modeling and lack of system/architecture level design.

#### III. FeFET CROSSBAR DESIGN

With FeFET crossbar as the core memory element, *each cell* in the crossbar performs a 1-bit multiplication. Fig. 2(a) shows the configuration of the FeFET crossbar, where the gate, drain, and source of the transistors, are connected to WL, BL, and source line (SL), respectively. Fig. 2(b) shows the corresponding layout view of a  $256 \times 256$  crossbar under the 28-nm technology (the layout is based on normal MOSFET). The left side of Fig. 2(c) shows the measured  $I_{\rm ds}$ – $V_{\rm gs}$  characteristics [13]. Two distinct threshold voltages (-0.2 and 0.9 V) are observed, and the ON/OFF ratio is more than  $10^5$ . One should note that with our FeFET crossbar configuration, the device can be programmed in a row-wise manner [19].

For computation, the weights are stored as transistor channel conductance (i.e., threshold voltage), and input vectors are used to drive WLs (i.e., transistor gate). We employ



FIGURE 3. (a) Configuration 1: SA + counter based TDC VMM engine design. (b) Configuration 2: row-by-row read and accumulation based VMM engine design.

FeFET-based AND logic [15] to perform the 1-bit multiplication. One-bit of weight is encoded as high  $V_{\rm th}$  or low  $V_{\rm th}$ , representing either 0 or 1, respectively; similarly, 1 bit of input vector can be encoded as high or low WL voltage ( $V_{\rm gs}$ ). When the input bit is 0 (i.e., low  $V_{\rm gs}$ ), the current is always 0 with either high  $V_{\rm th}$  or low  $V_{\rm th}$  since the transistor is turned off. On the other hand, if the input bit is 1 (i.e., high  $V_{\rm gs}$ ), the transistor is still off when  $V_{\rm th}$  is high but turns on when  $V_{\rm th}$  is low. The large ON/OFF ratio of FeFET, thanks to its steep subthreshold slope (<60 mV/dec) [20], creates a large difference between the output "1" current and output "0" current.

A key advantage of our design is that now the WL connects to transistor's gate, which is a capacitive load. Therefore, there is no wordline voltage drop issue (*RC* delay) as in the ReRAM scenario. Moreover, the drain voltage is fixed at 1 V, which is the system supply voltage. One should note that the read disturb effect (minor polarization loop caused by the reading operation) is less concerned and not explored in our design since both the pulse width and amplitude of the gate voltage are much smaller than the switching voltage reported in recent works [13], [21].

## IV. VMM ENGINE MICROARCHITECTURE

In this section, we present two configurations of the all-digital VMM engine to realize an ADC free design.

### A. CONFIGURATION 1: SA AND COUNTER BASED TDC

We propose a precharge/discharge approach based on VMM engine design, as shown in Fig. 3(a). First, the BL is precharged to the supply voltage  $V_{\rm dd}$ . Then, during computing, depending on how many devices in the same column are turned on, the BL voltage drops with a different speed. The comparator (sense-amplifier based) is used to sample the difference between the reference voltage  $V_{\rm REF}$  and  $V_{\rm BL}$  periodically (controlled by a clock signal clk). When clk is low, the output is 0 (reset). When clk is high, the output of the comparator is 1 if  $V_{\rm BL} > V_{\rm REF}$ , or 0 if  $V_{\rm BL} < V_{\rm REF}$ . Therefore, within 1 clock cycle, if  $V_{\rm BL}$  is larger than the reference voltage, the comparator generates a pulse; if not, the output of the comparator remains 0. A counter accumulates the number

of pulses from the comparator. Basically, with a simple sense amplifier (SA) and counter, we realize the time to digital converting (TDC). Simulation in 28-nm CMOS shows that our design consumes  $2.7\times$  less power than using ADC [7], while achieving the same speed.

# B. CONFIGURATION 2: ROW-BY-ROW READ AND ACCUMULATION

The ADC (or TDC in configuration 1) is inevitable if multiple WLs are simultaneously activated. To eliminate it, we propose to activate a single WL per clock cycle. As shown in Fig. 3(b), The first part of WL peripheral is a clock-driven one-hot vector unit. At the first clock cycle, the enable signal for the top WL (en<sub>1</sub>) is turned on. Then, in the second clock cycle, en<sub>1</sub> is disabled and en<sub>2</sub> enabled, and so on. The second part of the WL peripheral is a logic gate, which performs AND operation between the enable signal (en<sub>i</sub>) and input vector ( $a_i$ ). Only when both en<sub>i</sub> and  $a_i$  are high, the value stored in the corresponding memory cells  $b_i$  are sensed out. At BL peripheral, SA is employed to sense out the value (either 0 or 1) and send it to the counter at each clock cycle. Essentially, the MAC operation is performed with N cycles where N is the number of rows in the memory array.

While sacrificing the parallelism of analog computing, our design is faster than ADC-based approach. This is due to the fact that for prior ADC-based design, the speed of ADC is the major throughput bottleneck. For example, in ISAAC [7], a 1.3 giga-samples-per-second (GS/s) SAR ADC is employed and shared by a memory crossbar. It takes 100 ns to convert the analog values for a 128 × 128 (the crossbar size in ISAAC) memory array. In our design, it also takes 128 clock cycles to perform the same computing (loop through all the rows). For a 128 × 128 FeFET array (2 KB), our simulation indicates the internal clock frequency can go up to 4 GHz (reading frequency, not programming) in 28-nm technology, resulting 25.6 ns to perform the MAC operation, ~4× faster than the ADC-based solution.

#### C. COMPARISON OF THE TWO CONFIGURATIONS

While the proposed two configurations have very different reading/sensing scheme, the circuit level implementations



FIGURE 4. Matrix partition and mapping to multiple VMM engines. Different colors and shades are used to help track the input and weight mapping.

are similar. To be more specific, the sensing is realized by a SA (called comparator in configuration 1) and a counter. In essence, with the same system implementation, we can realize these two configurations simply by changing the control signal (i.e., the WL activations) patterns.

In terms of speed, the latency for configuration 1 is determined by the clock frequency (reference voltage  $V_{\rm REF}$  needs change accordingly) of the comparator and counter. For a FeFET crossbar with N rows, the counter needs to wait for at least N clock cycle to ensure the output has N levels. Similarly, the latency for configuration 2 is determined by the memory clock as the sensing is performed row-by-row. For an N rows crossbar, N clock cycles are necessary to get the final output.

In terms of accuracy, configuration 2 is preferred. In configuration 1, even though the sensing is digital, the BL discharge is still in the analog domain. Further, the device variation, cell leakage, nonlinearity of BL discharge, and temperature/voltage fluctuation can introduce a computing error. On the other hand, the MAC in configuration 2 is in the digital domain, providing much better fidelity in terms of computing accuracy.

#### V. DATA COMMUNICATION NETWORK

#### A. VMM ENGINE FOR LARGE SCALE MATRIX OPERATION

Fig. 4 illustrates the methodology to partition a large matrix—matrix multiplication across multiple VMM engines. Assuming the 1 VMM engine can hold parameters of size  $s \times s$ , the weight matrix is then partitioned into several small segments with the granularity of  $s \times s$ . In total,  $n \times m$  VMM engine will be used (Fig. 4). Similarly, the input matrix is first transposed, partitioned, and sequentially fed into the corresponding VMM engines.

From Fig. 4, we observe that each input segment is shared across multiple VMM engines horizontally (e.g., VMM11, VMM12, till VMM1m). We call it as *row-wise input sharing*.



FIGURE 5. (a) H-NoC. (b) Router design with accumulator integrated. (c) Three different data forwarding patterns and corresponding addresses, including one-to-one forwarding and broadcasting.

On the other side, partial results generated from the same column of multiple VMM engines should be summed together vertically (e.g., VMM11, VMM21, till VMMn1 in Fig. 4) since they belong to the same column in the original weight matrix. We call it as *column-wise output summation*.

# B. HIERARCHICAL NETWORK-ON-CHIP DESIGN

We propose an H-NoC to address the discrepancy between row-wise input sharing and column-wise output summation (Fig. 5). At the bottom level, four VMM engines share a router. Then, four such routers are connected to a router at a higher level. Since the VMM engines organized in a hierarchical fashion, a system with N level of routers can accommodate up to  $4^N$  VMM engines. Fig. 5(b) shows the router design, containing five input—output ports and corresponding I/O buffers. A  $5 \times 5$  switching matrix is equipped to route input—output ports and the routing is based on the store-and-forward (SAF) approach. Distinguished from conventional router designs, we insert a computing block (i.e., accumulator) inside the router to enable on-the-fly partial results summation. The benefits of H-NoC are twofold.

First, H-NoC realizes efficient row-wise input sharing. Fig. 5(c) illustrates three different data forwarding patterns. The first example shows the one-to-one forwarding. The top-level router decodes the first 4 bit of a packet (each bit represents the on/off of top-left, top-right, bottom-right, bottom-left output ports, e.g., "1000" means the packet goes to its top-left branch) and sends the packet to its sublevel router. Then the sublevel router decodes the next 4 bits and repeats until the packet arrives at the designated VMM engine at the top-left corner. Besides one-to-one forwarding, the packet can be broadcast. As shown in the last example of Fig. 5(c), since the first 4-bit address is "1111," the top-level



FIGURE 6. Case study to illustrate how data are mapped via H-NoC. We use different colors and shades to help track the input and weight mapping. (a) Input and weight data organization in vector and matrix form. (b) Input and weight data map to the VMM engine.

router broadcasts the packet to its sublevel routers in four directions. This process repeats, and finally, a single packet is assigned to 16 distributed VMM engines simultaneously.

A case study is used to illustrate how the row-wise input sharing benefits from input broadcasting. As shown in Fig. 6(a), a large weight matrix is first partitioned into several segments (we show  $2 \times 8 = 16$  segments, and more details about matrix partition and mapping are discussed in supplementary materials). Then, we map  $W_{11}$ ,  $W_{21}$ ,  $W_{31}$ , and  $W_{41}$  to four VMM engines sharing the same router node **4.** Then, input vectors are sent to the corresponding VMM engines (input sharing and reuse). For example,  $I_4$  (in blue) should go to the two VMM engines that store  $W_{41}$  and  $W_{42}$ . Conventionally, this requires two packets and two cycles since there are two destination VMM engines. With H-NoC, this can be done with a single packet and one cycle. As shown in Fig. 6(b), router **1** decodes the first 4-bit address (1100) and then broadcasts  $I_4$  to its sublevel router at the top-left and top-right directions (i.e., sends the packet to routers **2** and **3**). These two routers then decode the next **4** bits (1000) and send the packet to the top-left routers **4** and **6**. Finally, the packet goes to the bottom-right leaf VMM engines of routers 2 and 3.

Second, *H-NoC* is dedicated to efficient column-wise partial results summation. Enabled by the in-router accumulator, the summation of the results is performed on-the-fly, i.e., output summation happens during data transmitting. Again, we use the case in Fig. 6 as an example. It takes two steps to get the summation  $(S_{\text{total}} = \sum_{i=1}^{8} I_i \cdot W_{i1})$ . First, routers  $\mathbf{9}$  and  $\mathbf{6}$  works independently and parallelly, each receiving four partial results from the connected VMM engines and summing the partial data utilizing the built-in accumulator (i.e.,  $S_{\text{partial}} = \sum_{i=1}^{4} I_i \cdot W_{i1}$  and  $S_{\text{partial}} = \sum_{i=5}^{8} I_i \cdot W_{i1}$ ). Router  $\mathbf{9}$  then accumulates the partial results from  $\mathbf{9}$  and  $\mathbf{9}$  and sends the final summation to global buffer. Therefore, rather than sending each partial result to the global buffer as separate packets, only 1 packet is sent to the global buffer leveraging the on-the-fly/parallel processing enabled by H-NOC.

Depending on how many VMM engines are involved for one matrix computing, this process repeats until all the partial results are summed together. As routers in the same level are working in parallel, the worst case latency is limited to  $4 \times 10^{12}$  number of router levels, since it takes four clock cycles for a router to accumulate partial results from its four branches.

#### VI. CHIP-SCALE ARCHITECTURE

#### A. SYSTEM ARCHITECTURE

Fig. 7(a) shows the system architecture with VMM engines interconnected with H-NoC. We implement several fixed function units to support computation that cannot be accelerated within the VMM-engines, such as element-wise multiplication and activation functions. This ensures our design has the flexibility to support various CNN and RNN models. For example, the multiplier array and adder array are used for element-wise multiplication and addition, respectively. Additionally, their combining can be used to compute the Taylor series of some special functions, such as sigmoid. Since our design accelerates DNN inference, both adder and multiplier have 16-bit fixed point precision. Pooling processor is used for performing the average or max pooling, and ReLU for ReLU layer. There are several other fixed function units, such as max value search and divider (used in batch normalization layer). The last component in the system architecture is the microprocessor, which fetches/decodes the instructions and coordinates the data accessing and transmission.

## **B. EXECUTION FLOW**

The first layer of AlexNet is used as an example to illustrate the execution flow [Fig. 7(b)]. The input feature maps to this layer is  $224 \times 244 \times 3 \times 1$ , representing the image width, height, RGB channels, and mini-batch size, respectively. The Convolutional kernel size is  $3 \times 11 \times 11 \times 96$ , corresponding to the number of input channels, kernel width/height, and output channels, respectively. The microprocessor calculates how many memory subarrays are required to perform the computation. Assuming the memory subarray size is 256 × 256, each set of the convolution kernel (i.e.,  $3 \times 11 \times 11$ ) contains 363 weight parameters, and thus,  $\lceil 363/256 \rceil = 2$ crossbar arrays to perform the dot-production for one kernel. Further, kernels can be padded horizontally to achieve parallelism. Since there are 96 kernels, in total, we need  $96 \times 8 \div 256 \times 2 = 6$  memory subarrays, given that each element is an 8-bit number (256 devices in a row can hold 32 8-bit numbers). Then, the microprocessor will wrap the received data into discrete packets (the first few bits of a packet contains the routing address) and dispatch them to the target memory subarray locations via H-NoC. After the computing is done, the results are collected back and sent to the activation/pooling function units through the system bus.

Eventually, the output feature maps are generated and stored in the memory for temporary storage since they will be used as inputs for the next layer. Preferably, these data are stored in on-chip memory if there is available space;



FIGURE 7. Chip-scale architecture of the FeFET-based PIM design. (a) System architecture. (b) and (c) Execution flow.

alternatively, if all the on-chip memory subarrays are programmed with weights, the microprocessor will offload the temporary data off-chip. Typically, a single layer parameter size is less than 2 MB, and the offload of temporary data rarely happens for our benchmark CNNs and RNNs.

#### C. SOFTWARE/HARDWARE INTERFACE

A software/hardware interface is designed to bridge the gap between the software and hardware, letting users easily deploy their applications without specific hardware knowledge. As shown in Fig. 7(c), the runtime system takes DNN definition file (as well as a pretrained model if available) as input, sets the computing model and running precision, and performs layer-wise interpretation to translate the high-level DNN model definition to the instructions that we developed for the proposed system. There are three types of instructions: control, layers, and parameter specification. Instructions are 64 bit with the first 6 bit as Opcode. Control instructions are used to define computing precision, set running mode (only inference available now, supporting training is our future work) and write address register. Layer instructions define the layer and where the weight and activation should be fetched from. Parameter specification instructions are always attached to the layer instructions, specifying more information about the layer defined by the previous instruction. For example, to define the computation of a convolutional layer, we need two instructions, one for layer specification which defines the layer type and where we should read the weight and input from (i.e., weight and input address); the other one defines the convolutional kernel size, inputoutput feature map depths, etc. Details about the instructional design and more examples are available as supplementary materials.

#### **VII. SIMULATION RESULTS**

#### A. PROTOTYPE DESIGN

The prototype design contains 2048 VMM engines organized with six levels H-NoC. Each VMM engine contains a 256  $\times$  256 FeFET memory crossbar together with the peripheral circuitry. In addition, at the system level, we implemented several functional blocks, such as multiplier/adder arrays,

Design specification for the prototype implementation

| besign specification for the prototype implementation |                                                       |                                          |  |  |  |  |  |  |  |
|-------------------------------------------------------|-------------------------------------------------------|------------------------------------------|--|--|--|--|--|--|--|
| Configurations                                        | Config 1<br>(SA+counter based<br>TDC)                 | Config 2<br>(row-by-row<br>accumulation) |  |  |  |  |  |  |  |
| Computing domain                                      | Mixed signal                                          | Digital                                  |  |  |  |  |  |  |  |
| # of VMM engine<br>(memory capacity)                  | 2048 256x256 FeFET crossbar (16 MB)                   |                                          |  |  |  |  |  |  |  |
| Peak throughput                                       | 16.38 TOPs<br>(@ clk_mem: 2 GHz with 8-bit precision) |                                          |  |  |  |  |  |  |  |
| DRAM bandwidth                                        | 512 GB/s                                              |                                          |  |  |  |  |  |  |  |
| VMM power *                                           | 5.8 mW                                                |                                          |  |  |  |  |  |  |  |
| System power                                          | 18.2 W                                                |                                          |  |  |  |  |  |  |  |
| Technology                                            | 28 nm CMOS                                            |                                          |  |  |  |  |  |  |  |



<sup>\*</sup> For config 1, we ignore the power for reference voltage generation.

FIGURE 8. Design specification for the prototype implementation and the layout view for one VMM engine. Two configurations share the same circuit implementation.

pooling processor, and ReLU units. We performed SPICE simulation with 28-nm CMOS technology (normal MOS-FET model with calibrated threshold voltage and transistor size to mimic the I-V characteristic of real FeFET measurement data [13]) using extracted netlist of the crossbar together with the WL drivers and SAs to estimate the power and latency of the memory subarray. The SPICE simulation is then coupled with synthesized digital blocks (such as counters, H-NoC, functional blocks, and controller) to form completed chip-level modeling. While setting the system clock to be 1 GHz, the FeFET memory subarray can run at a higher clock frequency. Therefore, for configuration 1 (SA + counter based TDC approach), we set the clock to be 2 GHz. Similarly, for configuration 2 (row-by-row read and accumulation approach), the memory crossbar also has an internal 2-GHz clock. The off-chip memory bandwidth is set to be 512 GB/s, which is the same as with TPU-v2 [22]. The key design specification and the layout view for a VMM engine are presented in Fig. 8. One should note that we use the same circuit implementation but changing the WL activation pattern (parallel versus row-by-row access) to realize the two different configurations. Also, we ignore the power overhead of the reference voltage generation circuit in configuration 1 since it can be shared across the system. Therefore, the power of the VMM engine for these two configurations is also similar.



FIGURE 9. Peak throughput with different weight and activation precision. 32/16 means the weight is 32 bit while the activation is 16 bit.

#### **B. BENCHMARKS AND PRECISION**

We have five different types of DNN models (AlexNet, GoogleNet, VGG-16, VGG-19, and LSTM) with varying parameter size and computing complexity (i.e., GOPs).

We also explore the system peak performance with different bitwidth of DNN weights and activations. As illustrated in Fig. 9, with less bit-precision, the throughput is higher. We also note that when the activation is less than 8 bit (which is the precision for 256 rows FeFET crossbar), configuration 1 becomes faster as it takes a smaller number of clock cycles to accumulate the sensing result (a more detailed analysis for this observation can be found in the Supplementary Material).

In the following experiments, we assume that weights and activations have 8-bit precision because we observe that for the benchmark DNN models, 8 bit is good enough to ensure the inference accuracy. One should note that the state-of-theart DNN models (such as ResNet [23] and MobileNet [24]) typically require 16 bit or even floating point for the best accuracy. Supporting flexible bit-precision and floating point operation is our future work.

#### C. PERFORMANCE ANALYSES

First, for data transmission efficiency, we compare our H-NoC design with the naive approach (no input broadcasting/reuse or output on-the-fly processing) and ISAAC-like design (using the two-stage hierarchical buffer for output accumulation) [7]. Fig. 10(a) shows the data (input, weights, and internal temporary data) transmission latency for processing one image using four different benchmark CNNs. On average, our design reduces the latency by 14.5× and 6.7× over the naive approach and ISAAC-like design across the benchmark CNNs, respectively.

Second, we analyze the power efficiency of FeFET VMM engines and compare with ReRAM baseline design, as illustrated in Fig. 10(b). We first consider using ADC in the BL peripherals and insert buffer to drive the WL (i.e., resistive load). With a simple technology replacement from ReRAM to FeFET (using the same peripherals), we observe that the FeFET-based design achieves only  $1.2\times$  power reduction because the power consumption on the peripherals dominated. With the optimized digital-like peripherals (i.e., replace the power-hungry ADC with SA + counter and also eliminate the WL buffer since FeFET is a capacitive



FIGURE 10. System performance improvements for (a) H-NoC for data transmission and (b) VMM engine for computation.



FIGURE 11. Normalized inference speed of desktop GPU and our design for DNN models with varying batch sizes. For VGG-16 and VGG-19, we choose smaller batch sizes to avoid the *run out of memory* error.

load), significant power efficiency improvement is observed (another  $5.7\times$ ). In total, with the cross-cutting solutions combining emerging device technologies and circuit innovations, FeFET-based VMM engine demonstrates  $6.3\times$  power efficiency over the baseline ReRAM design.

We then evaluated the overall system performance using our benchmark DNN models and compared with the measured data from desktop GPU (Nvidia GTX 1080Ti with 11.3 TFLOPs throughput and 250-W power). Fig. 11 shows the speed (normalized) comparison for different DNNs under varying batch sizes. We do not differentiate the two VMM engine configurations because they have similar throughput when using 8-bit precision. We observed that our design outperforms the GPU solution by  $8.4\times$  in terms of frames per second (fps). Additionally, desktop GPU's power is  $13.7\times$  higher than our work, resulting in up to  $115\times$  computing efficiency (GOPs/W) improvement with our design.

# D. COMPUTING ACCURACY

The device variation of FeFET can potentially impact the computing accuracy. Similar to prior ReRAM-based design, we use Gaussian noise to represent the stochastic device variation [25]. We calibrate our device variation model

|                   | Technology | Hardware | Parameter<br>storage | Power<br>(W) | Area<br>(mm²) | Efficiency<br>(array-level) | Efficiency<br>(system-level) | Peak<br>throughput |
|-------------------|------------|----------|----------------------|--------------|---------------|-----------------------------|------------------------------|--------------------|
| DaDianNao [2]     | 28 nm      | ASIC     | eDRAM (on-chip)      | 20.1         | 67.7          |                             | 286 GOPs/W                   | 5.7 TOPs           |
| TPU-v2 [20]       |            | ASIC     | DRAM                 | ~ 250        |               |                             | 180 GOPs/W                   | 45 TOPs            |
| DeepTrain [5]     | 15 nm      | NMP      | DRAM                 | 7.2          |               |                             | 566 GOPs/W                   | 7.2 TOPs           |
| ISAAC [7]         | 28 nm      | PIM      | ReRAM                | 65.8         | 85.4          | 604 GOPs/W                  | 381 GOPs/W                   | 25.1 TOPs          |
| Neural Cache [10] | 28 nm      | PIM      | SRAM                 | 52.9         |               | 529 GOPs/W                  |                              | 28 TOPs            |
| Analog-FeFET [24] |            | PIM      | FeFET                |              |               | 840 GOPs/W                  |                              |                    |
| Our work          | 28 nm      | PIM      | FeFET                | 18.2         | 49.6          | 1234 GOPs/W                 | 896 GOPs/W                   | 16.38 TOPs         |

TABLE 1. Performance comparison with other DNN accelerators.



FIGURE 12. Computing accuracy under different level of device variation. The variation is characterized with Gaussian noise.

with experimental FeFET data in recently published works [13], [14]. The typical variation (the standard deviation:  $\sigma$ ) varies from 1% to 20%.

Fig. 12 shows the classification accuracy deterioration under device variation, considering the two proposed VMM engine configurations. As mentioned earlier, the first configuration eliminates the ADC but still performs part of the computing in the mixed-signal domain; thus, it is more vulnerable to device variation. On the other side, thanks to the large ON/OFF ratio of FeFET, the second configuration demonstrates good robustness toward the device noise.

#### E. COMPARISON WITH OTHER WORKS

We perform a detailed comparison between existing DNN accelerators implemented with application specified integrated circuit (ASIC) [2], [22], NMP [4], [5], and PIM architecture [7], [10]. For ASIC-based solution, we consider DaDianNao [2], which integrates a large amount of onchip eDRAM to store DNN parameters and TPU-v2 [22], the second generation of tensor processor unit from Google. For NMP, we investigate DeepTrain [5], a novel architecture which integrates logic layer into the high-bandwidth DRAM. We also compare with ISAAC [7], a pioneer work for ReRAM-based PIM architecture for DNN acceleration. For SRAM-based design, we evaluate a recent work, neural-cache [10], a bit-serial logic-in-memory based DNN accelerator architecture. At last, we compare with a recent FeFET-based design [26] which utilize FeFET as analog synapse (each device stores 5 bit) for DNN training acceleration. The key design features are summarized in Table 1. In terms of computing efficiency, we evaluate two different aspects, namely, array-level and system level. For array-level efficiency, only the energy consumed by the array (device and crossbar peripherals) is considered.

As a conclusion, our design achieves the state-of-the-art performance with 896 GOPs/W computing efficiency using 8-bit precision. Our design outperforms other PIM architectures by leveraging the following merits.

- The key horsepower comes from the VMM engine which eliminates the slow/power-hungry ADC, plus the high memory clock frequency.
- 2) H-NoC helps to reduce the data movement latency for both input reuse and output collection.
- With FeFET as the memory cell, we also benefit from the dense cell structure and low read latency/write energy.

### VIII. CONCLUSION

In this work, we propose FeFET-based PIM architecture to accelerate DNN inference. With FeFET as the basic memory cell and ADC free VMM engine design, the computing efficiency is significantly enhanced. A dedicated H-NoC is developed to realize fast and parallel data communication. As FeFET continues to mature toward a commercial technology, we show the pathway to a highly efficient architecture that successfully leverages unique properties of this technology to accelerate challenging data-intensive computing applications.

#### **REFERENCES**

- S. Han et al., "EIE: Efficient inference engine on compressed deep neural network," in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 243–254.
- [2] Y. Chen et al., "DaDianNao: A machine-learning supercomputer," in Proc. IEEE/ACM 47th Annu. Int. Symp. Microarchitecture, 2014, pp. 609–622.
- [3] Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor," in Proc. 42nd Annu. Int. Symp. Comput. Archit., 2015, vol. 43, no. 3, pp. 92–104.
- [4] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, "Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory," in *Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA)*, Jun. 2016, pp. 380–392.
- [5] D. Kim, T. Na, S. Yalamanchili, and S. Mukhopadhyay, "DeepTrain: A programmable embedded platform for training deep neural networks," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 37, no. 11, pp. 2360–2370, Nov. 2018.
- [6] P. Chi et al., "PRIME: A novel processing-in-memory architecture for neural network computation in reram-based main memory," ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 27–39, 2016.

- [7] A. Shafiee et al., "ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 14–26, 2016.
- [8] L. Song, X. Qian, H. Li, and Y. Chen, "PipeLayer: A pipelined ReRAM-based accelerator for deep learning," in *Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA)*, Feb. 2017, pp. 541–552.
- [9] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, "DRISA: A dram-based reconfigurable in-situ accelerator," in *Proc. IEEE/ACM 50th Annu. Int. Symp. Microarchitecture*, 2017, pp. 288–301.
- [10] C. Eckert et al., "Neural cache: Bit-serial in-cache acceleration of deep neural networks," in Proc. 45th Annu. Int. Symp. Comput. Archit., 2018, pp. 383–396.
- [11] Y. Long et al., "A ferroelectric FET based power-efficient architecture for data-intensive computing," in Proc. Int. Conf. Comput.-Aided Design, 2018. Art. no. 32.
- [12] J. Muller, T. S. Boscke, U. Schroder, R. Hoffmann, T. Mikolajick, and L. Frey, "Nanosecond polarization switching and long retention in a novel MFIS-FET based on ferroelectric HfO<sub>2</sub>," *IEEE Electron Device Lett.*, vol. 33, no. 2, pp. 185–187, Feb. 2012.
- [13] M. Trentzsch et al., "A 28 nm HKMG super low power embedded NVM technology based on ferroelectric FETs," in *IEDM Tech. Dig.*, Dec. 2016, pp. 11.5.1–11.5.4.
- [14] H. Mulaosmanovic et al., "Novel ferroelectric FET based synapse for neuromorphic systems," in Proc. Symp. VLSI Technol., Jun. 2017, pp. T176–T177.
- [15] A. Aziz et al., "Computing with ferroelectric FETs: Devices, models, systems, and applications," in Proc. IEEE Design, Automat. Test Eur. Conf. Exhib. (DATE), 2018, pp. 1289–1298.
- [16] Z. Wang, S. Khandelwal, and A. I. Khan, "Ferroelectric oscillators and their coupled networks," *IEEE Electron Device Lett.*, vol. 38, no. 11, pp. 1614–1617, Nov. 2017.
- [17] Z. Wang et al., "Experimental demonstration of ferroelectric spiking neurons for unsupervised clustering," in *IEDM Tech. Dig.*, Dec. 2018, pp. 13.3.1–13.3.4
- pp. 13.3.1–13.3.4.
  [18] X. Chen, X. Yin, M. Niemier, and X. S. Hu, "Design and optimization of FeFET-based crossbars for binary convolution neural networks," in *Proc. Design, Automat. Test Eur. Conf. Exhib. (DATE)*, Mar. 2018, pp. 1205–1210.
- [19] S. F. Mueller, "Development of HfO<sub>2</sub>-based ferroelectric memories for future CMOS technology nodes," Ph.D. dissertation, 2014.
- [20] M. H. Lee et al., "Physical thickness 1.x nm ferroelectric HfZrOx negative capacitance FETs," in IEDM Tech. Dig., Dec. 2016, pp. 12.1.1–12.1.4.
- [21] H. Mulaosmanovic, T. Mikolajick, and S. Slesazeck, "Accumulative polarization reversal in nanoscale ferroelectric transistors," ACS Appl. Mater. Interfaces, vol. 10, no. 28, pp. 23997–24002, 2018.
- [22] Google TPU-v2. [Online]. Available: https://cloud.google.com/tpu/docs/system-architecture
- [23] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit.*, 2016, pp. 770–778.
- [24] A. G. Howard et al., "MobileNets: Efficient convolutional neural networks for mobile vision applications," 2017, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs/1704.04861
- [25] B. Gao et al., "Ultra-low-energy three-dimensional oxide-based electronic synapses for implementation of robust high-accuracy neuromorphic computation systems," ACS Nano, vol. 8, no. 7, pp. 6998–7004, 2014.
- [26] M. Jerry et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training," in *IEDM Tech. Dig.*, Dec. 2017, pp. 6.2.1–6.2.4.



**YUN LONG** (S'15) received the B.S. degree in microelectronics from Peking University, Beijing, China, in 2014. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the Georgia Institute of Technology, Atlanta, GA, USA

His current research interests include emerging technology-based machine learning accelerator design, process-in-memory architecture design, DNN model reduction techniques, and machine

learning-based dynamical system model.

Mr. Long received the Wusi Fellowship and the National Fellowship in 2008 and 2009, respectively, when he was an undergraduate student at Peking University.



**DAEHYUN KIM** received the B.S. degree in semiconductor systems engineering from Sungkyunkwan University, Suwon, South Korea, in 2016. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the Georgia Institute of Technology, Atlanta, GA, USA

After graduation, he was with Samsung Electronics Co., Ltd., Hwasung, South Korea, for two years. His current research interests include

machine learning accelerator design and volatile/non-volatile memory designs with emerging technologies.



**EDWARD LEE** (GS'18) received the B.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, in 2015, and the M.S. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2017, where he is currently pursuing the Ph.D. degree.

He was a Mixed Signals Design Intern with Nvidia, Santa Clara, CA, USA, in 2016. His current research interests include low-power circuits

and power management for energy harvesting and volatile/non-volatile memory designs with emerging technologies.



PRIYABRATA SAHA (GS'19) received the B.Tech. and M.Tech. degrees in electronics and electrical communication engineering from IIT Kharagpur, Kharagpur, India, in 2015. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the Georgia Institute of Technology, Atlanta, GA, USA.

Before starting his Ph.D., he was with the ASIC Physical Design Team, Qualcomm, India. His current research interests include machine learning and computer vision.



BURHAN AHMAD MUDASSAR (S'14) received the B.S. degree in electronics engineering from the National University of Sciences and Technology, Islamabad, Pakistan, in 2012, and the M.S. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2015. He is currently pursuing the Ph.D. degree in electrical engineering and computer science.

Before pursuing his Ph.D., he was with the Center for Advanced Research in Engineering (CARE), Islamabad. His current research interests include embedded computer vision, smart applications for low-power IoTs, and hardware-software co-design.

Mr. Mudassar received the President's Gold Medal for Best Final Year Project and the Chancellor's Silver Medal, for securing second position in his batch during the B.S. degree. He also received the Fulbright Scholarship for Master's study at Georgia Tech, for the years 2013–2015.



**XUEYUAN SHE** (GS'18) received the B.S. degree in electrical engineering from the University of Virginia, Charlottesville, VA, USA, in 2017. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the Georgia Institute of Technology, Atlanta, GA, USA.

His current research interests include GPU-based machine learning accelerator design and spiking neural networks.



**ASIF ISLAM KHAN** (M'15) received the B.S. degree in electrical and electronic engineering from the Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2007, and the Ph.D. degree in electrical engineering and computer sciences from the University of California, Berkeley, CA, USA, in 2015.

He is currently an Assistant Professor with the Georgia Institute of Technology, Atlanta, GA, USA.



**SAIBAL MUKHOPADHYAY** (S'99–M'07–SM'11–F'18) received the B.E. degree in electronics and telecommunication engineering from Jadavpur University, Kolkata, India, in 2000, and the Ph.D. degree in electrical and computer engineering from Purdue University, West Lafayette, IN, USA, in 2006.

He is currently a Professor with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA. He has

authored or coauthored over 200 papers in refereed journals and conferences. He holds five U.S. patents. His current research interests include the design of energy-efficient, intelligent, and secure systems in nanometer technologies.

Dr. Mukhopadhyay was a recipient of the IBM PhD Fellowship Award in 2004 and 2005, the SRC Technical Excellence Award in 2005, the SRC Inventor Recognition Award in 2008, the IBM Faculty Partnership Award in 2009 and 2010, the National Science Foundation CAREER Award in 2011, and the Office of Naval Research Young Investigator Award in 2012.